WIP: Enable LLVM loop vectorizer #3929
Closed
+22
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While there's been some discussion of adding SIMD types in #2299, I thought it might be fun to see how well the LLVM loop vectorizer can do with Julia code. This PR compiles (but doesn't compile sysimg.jl), runs, and vectorizes things (sample), but there are two very big problems with it.
The first issue is that this PR turns all integer
add
operations intoadd nsw
in order to get that proof of concept to work. Scalar evolution analysis requires that operations on the loop index produce undefined behavior on overflow, but to change this generally might be too unsafe for a high-level language. Since we should be able to guarantee thatnext(::Range{Int})
/next(::Range1{Int})
doesn't overflow, one option is to addnsw
intrinsics and use those.The second issue is that this PR turns
jl_value_t
intoint8*
. Not only does this seem very wrong, it also breaks buildingsysimg.jl
, although Julia seems to run fine with a sysimg.jl built without this change. Unfortunately, I haven't been able to get the loop vectorizer to work withjl_value_t
as a structure type. With this change, the IR going into the loop vectorization pass looks like this, whereas without it, the IR looks like this. Notice that, withjl_value_t
asi8*
, the bitcast is outside of the loop, whereas withjl_value_t
as a structure type, it is inside the loop. This seems to bother the loop vectorizer, which tells me:If I move the bitcast out of the loop and compile the IR manually with
opt
, it seems to work, but I'm a little confused about what makes these cases different.